Statistics without spreadsheet

Tomasz Przechlewski

March, 2019

Who am I

My name is Tomasz Plata-Przechlewski and I live in Poland. I was born on 16th june 1963 (it was Sunday, the exact day when Valentina Vladimirovna Tereshkova was launched into space – if you know who she is).

BTW in Poland born-in-sunday means work-shy (ie. lazy) person (so you know now first Polish? proverb)

BTW by pure statistics \(1/7 \approx 14\)% of the population is work-shy:-)

I graduated economy long time ago and taught statistics and information systems (mainly). I am a big fan of open source software (or OSS) and I knew a few OSS systems including Linux and LaTeX. And of course R which I am about to show you in a while.

My hobby is Road Cycling and History. A I am also a amateur photographer. (cf tprzechlewski@flickr)

Agenda

Statistics (nothing spectacular, just classical EDA, no (heavy) math, relax)

Statistical software (modern, non-standard or hipster #youcall)

Poland (via statistical examples)

How statistics is badly taught (at Polish social-science departments)

Statistics (particularly in the social domain) is the most dangerous form of a lie. Why? Because of fanciful definitions, sloppy measures, poor samples, and erroneous computations.

Students are unaware about this pitiful state of affairs.

Three components of Statistics (statistical value chain 1-st version):

Theory (models) + Tools (programs) + Practice (real data)

Undergraduate courses on statistics concentrate on theory, and use some spreadsheet as an universal computing tool.

For students statistics is a lot of math formulas + Excel = difficult and boring

Students works with artificial (clean) and small data sets thus are unaware of problems related to applying theory to practice and/or about the data definition/collection stage.

Change of concept is urgently needed. Student should be aware about the true workflow of statistical analysis:

How statistics should be taught to students (in my opinion at least)

Office software has limits. Spreadsheets are good for number crunching, but are not so good in: data cleaning (Practice), advanced graphics, spatial analysis (Practical-Theory), team work (Practice).

Office editors or Powerpoint/ are great tools but are not quality publishing of statistical results.

It is wrong to ignore the existence of modern open source tools and not introduce students to them. It is wrong not to introduce students to some (even elementary) programming, and sticking exclusively to point-and-click mode of work (ie spreadsheet).

I will try to demonstrate that using modern tools for statistical analysis is a feasible way to go. That (some) modern tools are not much more (or prohibitive) difficult that office software (at higher than basic level of usage at least)

Conclusion: less theory, more practice and common sense. Show student real *‘value chain’ of statistical analysis with all its problems (not covered nowadays):

Poor definitions: Full Time Equivalence (FTE)

Number of students.

Who is a student?

Student is a person attending to a 3rd level status school in in the 3-stage education system (cf Educational_stage). The answer is still non-obvious as there are many forms of tertiary education. For example:

The UNESCO stated that tertiary education focuses on learning endeavors in specialized fields. It includes academic and higher vocational education.

So according to the above definition the school do not belongs to tertiary education if its status is not academic and/or higher vocational. Example: Dance Academy or University for Elderly people (aka University of the 3rd Age). Both are popular in Poland.

In many countries there are some certification scheme. For example in Poland a school must apply (and get) a certificate to be regarded as high school (ie part of tertiary level of education)

Heads vs Majors

Student can be enrolled to more than one course (major). So for counting heads it is necessary to remove duplicates otherwise one would count majors not persons.

Part time studies

FTE stands for Full-Time-Equivalent, an approximation of the number of students who would be enrolled full-time

Full time equivalent (FTE) – FTE is based on student credit hours. It is obtained by dividing student credit hours by some a number of credit hours for full-time-study.

Conclusion: Majors, Persons or FTEs? Which is the best?

University of Utah/Office of Analysis, Assessment and Accreditation google:single multiple majors fte

Measurement of tourism activity

Who is a tourist. According to Glossary:Tourism

Tourism means the activity of visitors taking a trip to a main destination outside their usual environment, for less than a year, for any main purpose, including business, leisure or other personal purpose, other than to be employed by a resident entity in the place visited.

According to the above definition to be regarded as tourist one has to change her/his accommodation place for less than one year (otherwise Eurostat would regard her/him as migrant)

The usual meaning (at least in Poland) is that tourist is travelling for leisure not to work. People travelling to work has other needs/aims than those travelling to rest (they usually do not use hotels for example) so the above definition solves some problems but at the same time creates many others.

Number of tourists: do not distinguish between various form of tourists, difficult to collect (who is a tourist anyway?)

Concept of an indicator

Various `number of’ tourist-oriented establishments (hotels, catering units, beds, nights spent) etc. They do not measure tourists per-se but are highly related and more reliable (as easier to count).

Indicator of tourist activity (by various tourist types).

Conclusion: measurement of tourism activity is not trivial Other similar: internet user, migrant, unemployed person, illiterate person

NACE = Statistical Classification of Economic Activities = the industry standard classification system used in the European Union. ACE uses four hierarchical levels: SectionDivision.Group.Class, where Section is denoted by a single letter. Examples:

Sloppy measurement: measurement of tourism activity

Two sides of tourism: supply side (hotels) / demand side (tourists)

BTW: demand = how much a product/service is desired; supply = how much the market can offer

Tourism supply statistics (accommodation statistics): Data on rented accommodation ie. capacity and occupancy of tourist accommodation establishments in the reporting country. How collected? Registers?

How statistical data is collected?: exhaustive data vs sample. Exhaustive: dedicated sureys (obligatory reports) vs administrative registers (births, deaths, police statistics). Sample: representative sample vs random sample vs panel data. Panels (cf Panel Research). Panels are overused nowadys (cf https://panelariadna.pl/):

Quirks of data collection: Data up to year 2015 inclusive refer to only those units that made the statistical reports. Starting of data for January 2016, the method of imputation data was implemented (ie replacing missing data with some (possibly meaningful :-)) values. (cf BDL)

Impu-what? Missing data problem

Tourism demand statistics: Data on participation in tourism of the residents of the reporting country. How collected? Surveys?

Most of the time, data on domestic and outbound trips (where “outbound tourism” means residents of a country travelling in another country) is collected via sample surveys (cf Annual data on trips of EU residents and Tourism_statistics_-_top_destinations)

Regulations concerning data collection in tourism (hundreds of pages): Glossary:Supply_side_tourism_statistics and EU regulation No 692/2011

So now we know what we are dealing with…

What, when and where

No doubt in every reliable survey the population has to be precisely defined ie 3 dimensions of every surveyed unit should be fixed: definition (what), time (when measured), space (where)…

I always repet to my students: if you look at some data (in the media for example), start from establishing if you know what, when and where. If no information (or reliable link–called source–to information) is provided on any of the fixed dimensions of data, treat this data as rubbish and do not waste time to use/analyse it.

Further dissemination of such defective data should be subjected to publicly prosecuted (joke)

I tried to show you already that what is complicated and often highly unreliable/arbitrary (the nature of the phenomenon or/and measurement difficulties).

When dimension is much more simpler due to universal standard, ie. time. You gather data or for a certain moment (how many hotels are in use in 31st December 2018) or for certain period of time (how many beds were sold in these hotels in 3rd quarter of 2018).

Where dimensions in turn is usually based on administrative or statistical (geographical) units (country, state/province, county, community). But contrary to time dimension there is no universal or globally-accepted standard for geostatistical units. Usually such a standard is based on administrative system which is country-dependent.

The administrative division of Poland since 1999 has been based on three levels of subdivision (cf Administrative divisions of Poland. In 2001 as Poland became a member of European Union, EU regulations are part of national law system.

EU regulates everything, statistics included.

Conclusion: The pigs had to expend enormous labours every day upon mysterious things called “files,” “reports,” “minutes,” and “memoranda.” These were large sheets of paper which had to be closely covered with writing, and as soon as they were so covered, they were burnt in the furnace (George Orwell, Animal Farm)

NUTS and TERYT

The Nomenclature of Territorial Units for Statistics (NUTS) is a geocode standard for referencing the subdivisions of countries for statistical purposes. The standard is developed and regulated by the European Union, and thus only covers the member states of the EU in detail (cf NUTS)

NUTS standard was revised several times (on the average every 4 years :-)), so there is even a page at ec.europa.eu domain dedicated to NUTS (short) history (cf NUTS history)

NUTS1 (level) – macroregion, NUTS2 – state, NUTS3 – subregion (several counties in case of Poland)

Poland is divided into 7 macroregions, 16 states (NUTS2), and 72 subregions (NUTS3).

NUTS1 level is only for statistical purposes (but regions are in fact distinct due to history, economics, natural-conditions, cultural factors etc… )

There is a relevant and interesting page by GUS (Main Statistical Office or Główny Urząd Statystyczny), but unfortunately in Polish (use google translate :-) in case you are interested or mail me) (cf Klasyfikacja NUTS w Polsce )

The above map shows 7 macroregions (NUT1) and 16 provinces (NUTS2). BTW province in Polish is “prowincja” (due to both are from Latin) but actually Polish administrative provice is called “wojewĂłdztwo”, from “wodzić” – ie commanding (the armed troops in this context). This is an old term/custom from the 14th century, where Poland was divided into provinces (every province ruled by a “wojewoda” ie chief of that province). More can be found at Wikipedia (cf Administrative divisions of Poland)

NUTS3 consists of 380 counties grouped into 72 subregions.

A Polish county (called “powiat”) is 2-nd level administrative unit.

In ancient Poland powiat was called “starostwo” and the head of a “starostwo was called”starosta“.”Stary" means Old, so “starosta” is an old (and thus wise) person. BTW the head of powiat is “starosta” as 600 years ago:-)

The 3rd level administrative unit is called “gmina” (community).

There are (approximately) 380 counties and 2750 communities in Poland.

As Poland population is 38,5 mln and the area equals 312,7 sq kilometers (120 persons per 1 sqkm) on the average each powiat has 820 sqkm and each community has 113.5 sqkm or approximately 100 thousand persons per “powiat and 14 thousand per”gmina“.

TERYT is a Polish NUTS (developed some 50 years ago). It is complex system which includes identification of administrative units. Every unit has (up to) a 7-digit id number: wwppggt where ww = “wojewĂłdztwo” id, pp = “powiat” id, gg = “gmina” id and “t” decodes type-of-community (rural, municipal or mixed). Higher units has trailing zeros for irrelevant part of id, so 14 or 1400000 means the same; as well as 1205 and 1205000. Six numbers is enough to identify a community (approx 2750 units).

So you are now experts on administrative division of Poland, and we can go back to statistical charts…

Poor definitions: dreadful aggregates

Indicators can be divided to hard indicators and soft indicators. Hard indicators denote hard facts while soft indicators are beliefs and intentions. For example number of hotels is a fact, while intention to stay abroad less than a year is not a fact but an intention. In Poland at least 80% respondents declares they intend to vote, while the true turnover never exceed 55%. In other words measuring something using soft indicators is prone to (significant) errors.

That not means that hard indicator is error-free. By definition it measures not the phenomenon but some proxy associated with the phenomenon.

With hard indicator we have precise measurement of imprecise measure. With soft indicator we have imprecise measurement of imprecise measure.

To cure (or hide) the problems aggregates of indicators are constructed, eiter as sums (indexes or formative) or as averages (factors or reflective. Indexes are more popular in economics while averages/factors are more popular in psychology, sociology etc…

For example Gross National Product (GDP) is an index while (customer) satisfaction defined as some set of opinions on a product would be a factor.

Control question: what is measured with GDP?

Collection methods from most to last reliable:

Typical collection method description of a sample based survey: Data collected from 1st April to 2nd April 2019. Cross national panel (or sample). Respondent age +18. Panel size 1020. Quotas representative for sex, age and residence type

No information provided on non-response rate/non-contact rate (why?).

Example: How to measure illiteracy? A. Ask a straight question (can you read/write). B. Ask a question how many books respondent read last year, if zero = illiterate (nasty!). C. Ask for certificate (infeasable). I wonder about illiteracy rate of many countries if approach B would be excercised:-)

Statistical analysis stage: charts

In spite of the fact that statistical charts are now ubiquitous in the media this topic is usually covered marginally at most courses on statistics, probably because it is pretty hard to produce quality graphics with office software (complexity vs difficulty).

Statistical charts can be plotted for the following three purposes:

Note: It is often recommended by some researchers to use charts at data cleaning stage of statistical analysis. I do not agree with it. Data cleaning can be automated and should not relay nor on manual work nor on visual inspection. Using programs to check data is more efficient and reliable procedure. It is also 100% replicable contrary to visual inspection.

A visual-art designer not statistician is a right person for the 1st purpose. I am not an art-designer so I will not tell you how to prepare eye-catching pictures. I am a statistician and I will concentrate on effective graphical methods for statistical explanation/exploration. And by effective I mean that one (graphical) method is more effective than another if its quantitative information can be decoded more quickly/easily [Robbins 2005]

Types of charts

Some graphs are better than others:

Note: bar/line/pie charts were introduced by William Playfair in XVIII century. Dot plots were introduced by John Cleveland (1980s). Box-plots were introduced by John Tukey (1970s)

More Playfair’s charts can be found via google or in Syamnzik’s paper

Pie charts, dot plots and histograms

Nights spent by non residents

Strip charts

A strip chart (strip plot) shows the distribution of data points along a numerical axis.These plots are suitable compared to box plots when sample sizes are small (because preserve more information about the data).

Example: Number of hotels in powiat by region (NUTS1, 2017):

The biggest potential problem with a dot/scatterplot is overplotting: whenever one has more than a few points, points may be plotted on top of one another. This can severely distort the visual appearance of the plot (left panel)

There is no one solution to this problem, but there are some techniques that can help: use smaller dots, use semi-transparent dots (right panel), use jitter.

Jitter—a small random noise added to data, is shown below (higher jitter on the right panel)

Histograms and kernel density functions

Histograms show the distribution of a set of data. To draw a histogram the numbers (observations) are grouped into bins (intervals or classes). There is a trade-off between showing details or showing an overall picture. When bin width changes the scale at Y-axis changes as well (more bins less observations in each bin). Example number of hotels in Poland (2017):

ggplot(d, aes(x = hotele2017)) +
  geom_histogram(bins = nclass.Sturges(d$hotele2017))

Histograms with binwidth equal to 20, 10, 5 and 1 respectively:

Drawback of histogram: scale is bin (width) dependent.

Kernel density functions

ggplot(data=d) + geom_density(aes(x=hotele2017))

p1 <- ggplot(data=d) + geom_density(aes(x=hotele2017), adjust=0.25)
p2 <- ggplot(data=d) + geom_density(aes(x=hotele2017), adjust=1.0)
p3 <- ggplot(data=d) + geom_density(aes(x=hotele2017), adjust=2.0)
p4 <- ggplot(data=d) + geom_density(aes(x=hotele2017), adjust=8.0)
ggarrange(p1,p2,p3,p4)

Comparing distributions: box-plots

Box-plots are much better than histograms for comparing distributions of more than one data sets.

Construction of a (typical) box-plot: The middle bar is a median. Top/bottom bars of the rectangle shows the IQR (interquartile range is 1st and 3rd
quartile), the fanciful bars above/below rectangle called whiskers (google: whiskers mustache :-) are 1,5 times the IQR (or minimu/maximum if those values are less than plus/minus 1,5 IQR. The symbols above/below whiskers (usually open circles) are outliers (non typical/extreme values)

Note the trick: outliers are defined not as (for example) top/botom 1% fraction of values (every distribution would has outliers in such a case) but as values less/more than Me - 1,5IQR (distributions with medium variablity would not have outliers)

Example: age of Nobel-prize winners (cf The Nobel Prize API Developer Hub)

nlf <- read.csv("nobel_laureates3.csv", sep = ';', dec = ",",  header=T, na.string="NA");

ggplot(nlf, aes(x=category, y=age, fill=category)) + geom_boxplot() + ylab("years") + xlab("");
## Warning: Removed 39 rows containing non-finite values (stat_boxplot).

Multiple histograms are too detailed (binwidth=5). It is impossible for example to establish which category has the youngest (on the average) laureate, or which category has an oldest one (economics and literature are candidates, but due to multimodality of literature laureates distribution it is difficult to assess this for sure…)

Comparing distributions box-plots vs multiple histograms

Number of hotels in powiat by wojewĂłdztwo (2017):

More jitter:

Boxplots are better:

Scatter-plots

A scatter-plot (aka scatter diagrams, xyplot) is a basic form used for two (quantitative) variables.

To see the relationship between variables, a line is can be fitted. Least square (LS) line which assumes linear relationship between variables, is fitted by minimizing the sum of squares of the residuals (residual is the difference between a data-point and a relevant line-point ie a point computed from the formula y = a +bx where x is the value of the x-axis variable.)

(Almost) each part of Poland is attractive for tourists, but those counties which are at the seaside (north) or in the mountains (south) are special. There are 11 counties at the seaside (morze = sea) and 18 in the mountains (gĂłry):

## 
## Call:
## lm(formula = tz2017 ~ y2017, data = m)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -127494  -22228    4466   20779   84551 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   -52055      45219  -1.151   0.2793  
## y2017           5839       2324   2.513   0.0332 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 60310 on 9 degrees of freedom
## Multiple R-squared:  0.4123, Adjusted R-squared:  0.347 
## F-statistic: 6.314 on 1 and 9 DF,  p-value: 0.03316
## 
## Call:
## lm(formula = tz2017 ~ y2017, data = m)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -26224  -5432   2165   5616  25199 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11071.5     5820.8  -1.902 0.086330 .  
## y2017          961.7      161.4   5.960 0.000139 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13270 on 10 degrees of freedom
## Multiple R-squared:  0.7803, Adjusted R-squared:  0.7584 
## F-statistic: 35.52 on 1 and 10 DF,  p-value: 0.0001394

So each new hotel in the mountains on the average would attract 961.6 foreign tourists, while a new hotel at the seaside would attract 5838 foreign tourists (and both numbers are statistically significant at \(\alpha=0.05\):-) )

Alternatively loess curve can be used which do not assumes linearity but is parameters are not interpretable.

Scales

Logarithmic scale makes it possible to plot values with too wide range for a linear scale. Base 10 logarithms squeeze' the numbers more than base 2 logarithms (log10(100)=2 wile log2(100)=6.64. Moreover if the original scale contains multiplications of 10 use log10 to getnice’ log-scale while it contains multiplications of 2 use log2.

Logarithms transforms additive scale to `multiplicative’ one. Example (Nobel prize again):

dA <- read.csv("nobel_laureates3.csv", sep = ';', dec = ",",  header=T, na.string="NA");
nrow(dA)
## [1] 934
dS <-  subset(dA, (! bornCountryCode == "" )) # by country of birth
nrow(dS) # how many
## [1] 901

aggregate by bornCountryCode

Finally plot the resulting data using various Y-axis scales (arithmetic, log2 and log10)

The exact figures are as follows:

## 
##      AR  AT  AU  AZ  BA  BD  BE  BG  BR  BY  CA  CH  CL  CN  CO  CR  CY 
##   0   4  17  10   1   2   1   9   1   1   4  19  17   2  12   2   1   1 
##  CZ  DE  DK  DZ  EG  ES  FI  FR  GB  GH  GP  GR  GT  HR  HU  ID  IE  IL 
##   6  82  12   2   6   7   5  55 100   1   1   1   2   1   9   1   5   6 
##  IN  IR  IS  IT  JP  KE  KR  LC  LR  LT  LU  LV  MA  MG  MK  MM  MX  NG 
##   8   2   1  19  26   1   2   2   2   3   2   1   1   1   1   1   3   1 
##  NL  NO  NZ  PE  PK  PL  PT  RO  RU  SE  SI  SK  TL  TR  TW  UA  US  VE 
##  18  12   3   1   3  25   2   4  26  29   1   1   2   3   1   5 269   1 
##  VN  YE  ZA  ZW 
##   1   1   9   1

Graphic perception tasks

From the best to the worst:

Angle judgement is not precise. Acute angles are underestimated while obtuse angles (greater than 90) are overestimated.

Area judgement is biased as well. It is impossible to distinguish small differences in area, while quite easy when the same date is plotted along common scale

The most accurate of graphic task is positioning along common scale

General design rules

Ink in this definition refers to non-erasable ink used for the presentation of data. If data-ink would be removed from the image, the graphic would lose the content. Non-Data-Ink is accordingly the ink that does not transport the information but it is used for scales, labels and edges.

Good graphics should include only data-Ink. Non-Data-Ink is to be deleted everywhere where possible. The reason for this is to avoid drawing the attention of viewers of the data presentation to irrelevant elements. There is an short an excellent video clip at YouTube which illustrates this rule.

Data to Ink Ratio

Lie factor (LF) is a ratio as well but defined as size of the effect shown in graphics to the size of effect in data. Preferaby LF should equal 100%. According to Tufte, LF greater than 1.05 or less than 0.95 signals significant distortion. This rule can be best explained with an example.

Lie factor example

This giant guy (GG) in the middle is our ex-president. The guy next to him on the left is our current president Duda. Next to Duda is ex-rock star Kukiz, dark-horse of the elections. This is the cover (slightly modified) of influential polish weekly magazine form May 2015, shortly before elections.

The figures are claimed to be in-sync with the recent survey results (sort of a barchart). Could you figure-out from that chart about the proportion of scores of each candidate? How much the giant-guy outperforms the runner-up candidate? Which candidate is supported by this influential magazine (easy:-)?

The lie-factor details:

The line from shoes to top of the head equals (at certain size of course) 204mm for GG, 134mm for Duda and 42.5mm for ex-rock star. So \(204/134=1.5\) and \(204/42.5 \approx 4.8\). As \(44/29 \approx 1.5\) and \(44/9 \approx 4.8\) as well formally the lieFactor is perfect. But should one compares lengths or areas?

If one compares areas not heights, one get significantly different (and correct) results, namely: \((204 * 58) /(134 * 21)= 4.20\) and \((204 *58)/(42.5 *15) \approx 18.56\). Lie factor is \(4.2/1.5 =280\)% and \(18.56/4.8=387\)% respectively. Huge distortion

Moreover two more tricks were applied to boost GG. Can you see them?

BTW: the text in the pink frame claims: “figure ratios are consistent with april-may survey outcome.”" (But what exactly figure ratios means?)

Banking to 45

The ratio between the width and the height of a rectangle is called its aspect ratio.

The aspect ratio describes the area that is occupied by the data in the chart. A change in aspect ratio changes the perception of the graph. The question is which aspect ratio is the best.

We can recognize change most easily if absolute slopes equals to 45 degree angle on the graph. It is much harder to see change if the curves are nearly horizontal/vertical. The idea (Cleveland, 1988) behind banking is therefore to adjust the aspect ratio of the entire plot in such a way that most slopes are at an approximate 45 degree angle.

Setting the aspect ratio so that the average of the values of the orientations is 45 degrees is called “banking the average orientation to 45 degrees”.

Setting the aspect ratio so that the weighted mean of line segments (weighted by segments’ length is approx 45 degrees is called average weighted orientation method (to 45 degrees).

Exercise: assess which slope is the steepest one and which is the smallest one?

BTW: every chart presents the same data on CO2 emission (average for May each year) as provided by US Government’s Earth System Research Laboratory, Global Monitoring Division. (cf CO2 PPM - Trends in Atmospheric Carbon Dioxide)

Elementary spatial analysis (Heat maps/choropleth maps)

I do not intend to give you a full lecture on spatial methods/analysis now. First of all I am not an expert in this area. Second most of the methods develops by cartographers are not used in the domain of social-sciences. But a few are popular, are simple and are pretty impressive (to family and friends at least (F&F), ie. for non-professionals) Why not to use them? (To make impression on your F&F and/or your boss?). These methods are:

A choropleth map is a thematic map where geographic regions are colored, shaded, or patterned in relation to a value.

Feature map is a map augmented with position of an object of interest somehow marked.

A heat map represents the intensity of objects occurrence within a dataset. A heatmap uses color to represent intensity, though unlike a choropleth map, a heatmap does not use geographical or geo-political boundaries to group data. This technique requires point geometries, as you are looking to map the frequency of an occurrence at a specific point.

One can think of choropleth map as a kind of spatial histogram, while feature map (heat map) is a kind of spatial dot-plot (fanciful spatial dot-plot).

Spatial analysis tools

Google Geoservices are now non-free if used not “directly” but with API. One have to register an credit/debit card and sign some obscure license to use them. I used to use Google for years but stop using them last year.

Google shut down some cool geo-services including Google Fusion Tables (launched 10 years ago). I was a big fan of GFT and I am greatly disappointed about the decision to shutting them down now.

QGIs is a full-featured, matured (2002) and powerful open source geographic information system (GIS) software.

It allows to analyze and edit spatial information, in addition to composing and exporting graphical maps. QGIS supports both raster and vector layers; vector data is stored as either point, line, or polygon features. Multiple formats of raster images are supported, and the software can georeference images.

QGIS supports shapefiles, PostGIS (ie the most important ones), and other formats. Web services, including Web Map Service and Web Feature Service, are also supported to allow use of data from external sources (Open Street Map for example).

QGIS integrates with other open-source GIS packages, including PostGIS, GRASS GIS, and MapServer. Plugins written in Python or C++ extend QGIS’s capabilities.

To start with QGIS simply go to www.qGIS.org, download it and start installtion. No need to learn the whole system. Being (somehow) acquainted with Project and Layer/Add Layer menu item is enough:

Example: Poland (population, incomes, distribution of)

A CSV file PL_powiaty_2017.csv which I compile for this lecture contains cross-sectional data for every Polish powiat (generally for 2017.) Among other things one can find there:

d <- read.csv("PL_powiaty_2017.csv", sep = ';',  header=T, na.string="NA");

revF <- fivenum(d$przychodMF)
revM <- mean(d$przychodMF, na.rm=T)
revD <- sd(d$przychodMF, na.rm=T)

c(revF, revM, revD)
## [1]  90.39000 150.66500 183.42000 235.77500 679.11000 202.47941  77.62899
ggplot(d, aes(x = przychodMF)) + geom_histogram(bins = nclass.Sturges(d$przychodMF))

ggplot(d, aes(x = przychodMF)) + 
  geom_histogram(binwidth = 40) # about 10 USD (as of march 2019)

So on the average the revenues was 202.48 and the relative dispersion 38.34% zĹ‚otych (fortunately Poland do not “join Euro area”, and we still use local currency called zĹ‚oty; zĹ‚oty means literally “made of gold” BTW). Half of powiats’ revenues was between 150.66 zĹ‚oty and sprintf("%.2f", r revF[4]) zĹ‚oty or PLN (Q1/Q3) with 90.39 PLN minimum and 679.11 PLN maximum incomes respectively.

To understand the spatial distribution of wealth one can plot choropleth map (using QGIS not R):

Number of high schools

Number of hotels

population density (number of people per kilometer square)

More examples can be found at my Github account (URL will be provided on the last slide)

Example: UNESCO World Heritage List

A World Heritage (WH) Site is a place that is listed by the United Nations Educational, Scientific and Cultural Organization (UNESCO) as having special cultural or physical significance. There is a inventory of WH sites at whc.unesco.org. This list are available in various formats including Excel format and when rendered with QGIS looks like:

Heatmaps shows density more clearly (or not–opinions are contradictory):

Example: Nobel prize winners by place of birth

Remember Nobel winners by country? With heat-maps one can plot them on the map:

Example: Concentration of Polish big industry

Every year Rzeczpospolita, a nationwide daily economic and legal newspaper compiles a list of 2000 biggest companies (idea similar to Fortune 500 list) The distribution is highly skewed and concentrated, but what about spatial distribution of Polish big companies? Feature/Heat to the rescue…

BTW: the small picture in the middle depicts Poland during Weichselian and WĂĽrm cold period (15,000–11,700 years ago, cf Weichselian glaciation)

Crusaders, Knights and Malbork castle

First short explanation about the subject of the analysis ie famous Castle of the Teutonic Order in Malbork which is enlisted at UNESCO heritage list (cf UNESCO heritage list ):

Several religious military orders were formed in the Holy Land during the Crusades Templars, Hospitallers, Teutonic Knights

The Teutonic Knights or the Teutonic Order of the Hospital of St. Mary in Jerusalem, were known in Poland as KrzyĹĽacy on account of the black cross they wore on their white coats.

Established in 1190 to protect German pilgrims in the Holy Land, the order was later transformed in order to fight heretics.

In 1226 the Teutonic Knights came to Poland, invited by Duke Konrad I of Mazovia to fight with the annoying pagan Prussian tribes invading Poland from time-to-tme from the north. Teutonic Knights conquered Prussia, exterminated the locals and founded a powerful state with Malbork (Marienburg or Mary’s castle in German) as its capital.

BTW: Kwidzyn in German is called Marienwerder (Mary’s meadow) and there were a lot more places named Marien-something (as Marien is St Mary in German)

BTW2: There is about 40km from Kwidzyn to Malbork :-)

Example of a very bad graphs

There is a research, peer-reviewed paper on tourist traffic in the castle’s museum of Malbork

The determinants of the tourist traffic in the castle’s museum of Malbork

Unfortunately all charts in this paper contains elementary errors. Could you identify them?

if one insists on using piecharts (improved version):

or better, using bar/dot charts:

Even worse graphics (yes we can:-) )

Piecharts are notorious for obscurity:

What about this barchart (distribution of seats in Polish parliament (Sejm) after 2015 elections—50% majority is 430 seats)?

Remember dark-horse ex-rock start Kukiz? IMO his bar does not looks like being equal to 50 votes (minus 1.) PO-bar is peculiar as well…

Not mention about strange tilt to the left…

There is a life without speadsheet too: R and Rstudio

R is both programming language for statistical computing and graphics and a software (ie application) to execute programs written in R. R was developed in mid 90s at the University of Auckland (New Zealand).

Since then R has become one of the dominant software environments for data analysis and is used by a variety of scientific disiplines.

BTW why it is called so strange (R)? Long time ago it was popular to use short names for computer languages (C for example). At AT&T Bell Labs (John Chambers) in mid 70s a language oriented towards statistical computing was developed and called S (from Statistics). R is one letter before S in an alphabet.

Rstudio is an environment through which to use R. In Rstudio one can simultaneously write code, execute code it, manage data, get help, view plots. Rstudio is a commercial product distributed under dual-license system by RStudio, Inc. Key developer of RStudio is Hadley Wickham another brilliant New Zealander (cf Hadley Wickham )

Microsoft invest heavily into R development recently. It bought Revolution Analytics a key developer of R and provider of commercial versions of the system. With MS support the system is expected to gain more popularity (for example by integrating it with popular MS products)

Measures of central tendency, dispersion and skewness with R

(univariate analysis)

The CSV file hotele_caloroczne_PL.csv contains data on number of all-season hotels in every county in Poland. First one has to load the dataset with the read.csv command:

d <- read.csv("hotele_caloroczne_PL.csv", sep = ';',  header=T, na.string="NA")

Computing measures of central tendency (with summary and/or fivenum)

summary(d)
##      teryt                powiat      hotele2012        hotele2017    
##  Min.   : 201   bielski      :  2   Min.   :  0.000   Min.   :  0.00  
##  1st Qu.:1005   brzeski      :  2   1st Qu.:  3.000   1st Qu.:  4.00  
##  Median :1636   grodziski    :  2   Median :  5.000   Median :  7.00  
##  Mean   :1721   krośnieński:  2   Mean   :  8.776   Mean   : 10.31  
##  3rd Qu.:2475   nowodworski  :  2   3rd Qu.: 10.000   3rd Qu.: 11.00  
##  Max.   :3263   opolski      :  2   Max.   :158.000   Max.   :183.00  
##                 (Other)      :368   NA's   :1
fivenum(d$hotele2017)
## [1]   0   4   7  11 183

Computing mean:

mean(d$hotele2017)
## [1] 10.31053

And dispersion:

var(d$hotele2012); var(d$hotele2017)
## [1] NA
## [1] 244.8743
sd(d$hotele2012); sd(d$hotele2017)
## [1] NA
## [1] 15.64846

Second attempt (with no output/respective values was saved as variables var12sd17):

var12 <- var(d$hotele2012, na.rm=T); var17 <- var(d$hotele2017, na.rm=T)
sd12 <- sd(d$hotele2012, na.rm=T); sd17 <- sd(d$hotele2017, na.rm=T);

BTW:

c( mean(d$hotele2012, na.rm=T), mean(d$hotele2017, na.rm=T))
## [1]  8.775726 10.310526

Or more formally. There were 8.7757256 hotels on the average in every county in Poland in 2012 while in 2017 there were 10.3105263 hotels.

Interquartile Range aka IQR which is the range from the upper (75%) quartile to the lower (25%) quartile. IQR represents central 50% observations of a population. IQR is a robust measure of dispersion, unaffected by the distribution of data:

c( IQR(d$hotele2012, na.rm=T), IQR(d$hotele2017, na.rm=T))
## [1] 7 7

Finally we can equally easily assess the skewenss:

library(moments)
c(skewness(d$hotele2012, na.rm=T), skewness(d$hotele2017))
## [1] 5.998884 5.899827

Distribution skewness is significant in both periods. Using (modified) Persons’ formula \((\bar x -D )/ \sigma^2\) we obtain:

library("DescTools")
(mean(d$hotele2017) - Mode(d$hotele2017) )/ sd17  
## [1] 0.4032682

Still the distribution is positively skewed, but the value of the coefficient is much smaller.

New workflow: reproducible research

Sorry but why use all this strange stuff at all? The most important argument why I will present momentarily and it concerns the basic approach of doing statistical analysis.

This mode (or concept) is called Reproducible Research (RR in short).

Serious statistical analysis is not one-off job. There is a value-chain as well as a life cycle of statistical analysis.

Value chain means that there are distinct stages while life cycle that the same data/models are used for years and most statistical analysis do not start from the scrach but are based on data from the past augmented with new data.

The problem is that the new data and model modifications should be in-sync with the past.

The make the problem worse, serious statistics should be also in-sync with the work of others (to ease or to make possible any meaningful (international) comparisons for example)

Reproducible research or how to make statistical computations more meaningfu

Abandoning the habit of secrecy in favor of process transparency and peer review was the crucial step by which alchemy became chemistry. Eric S. Raymond, E. S. The art of UNIX programming: Addison-Wesley.

Replicability vs Reproducibility

Hot topic: google: reproducible research = 158000

Replicability: independent experiment targetting the same question will produce a result consistent with the original study.

Reproducibility: ability to repeat the experiment with exactly the same outcome as originally reported [description of method/code/data is needed to do so].

Computational science is facing a credibility crisis: it’s impossible to verify most of the computational results presented at conferences and in papers today. (Donoho D. et al 2009)

Australopithecus (Current practices)

Use Excel for data cleaning & descriptive statistics Excel handles missing data inconsistently and sometimes incorrectly Many common functions are poor or missing in Excel

Use SPSS/SAS/Stata in point-and-click mode to run serious statistical analyses.

Problems

Tedious/time-wasting/costly.

Even small data/method change requires extensive recomputation effort/careful report/paper revision and update.

Error-prone: difficult to record/remember a ‘click history’.

Famous example: Reinhart and Rogoff controversy Countries very high GDP–debt ratio suffer from low growth. However the study suffers serious but easy identifiable flaws which were discovered when RR published the dataset they used in their analysis (cf Growth_in_a_Time_of_Debt)

Homo habilis (Enhanced current practices)

Benefits

Improved: reliability, transparency, automation, maintanability. Lower costs (in the long run).

Solves 1–2 but not 3–4.

Problems: Steeper learning curve. Perhaps higher costs in short run. Duplication of effort (or mess if scripts/programs are poorly documented).

Homo Erectus (Literate statistical programming)

Literate programming concept: Code and description in one document. Create software as works of literature, by embedding source code inside descriptive text, rather than the reverse (as in most programming languages), in an order that is convenient for human readers.

A program is like a WEB tangled and weaved (turned into a document), with relations and connections in the program parts. We express a program as a web of ideas. WEB is a combination of – a document formatting language and – a program language.

General idea of Literate statistical programming mimics Knuth’s WEB system.

Statistical computing code is embedded inside descriptive text. Literate statistical program is weaved (turned) into report/paper by executing code and inserting the results obtained. data/method changes.

Solves 1–4.

LSP: Benefits/Problems/Tools

Problems of LSP: Many incl. costs and learning curve

Tools:

Github for the uninitiated

The basic idea is that instead of manually registering changes one has made to data, documents etc, one can use software to help him manage the whole process. Such software is called Version Control Systems or VCS

VCS not only manages content, registering each modification of it, but control access to the content as well. Thus many individuals can work on common project (compare this to common scenario of mailing spreadsheets to each other–highly inefficient at least)

There are highly reliable and publicly available VCS services and GitHub is the most popular of them.

GitHub is owned by Microsoft (do not use if you boycott MS :-))

I use GitHub as an educational tool: to distribute learning content to my students and to store content they produce for me (ie projects)

The free GitHub account is public. It is OK for me. If it is not OK for you, you can buy a license for commercial account or do not use GitHub.

Summary: New Tools (hipster part)

Summary New practice, learning resources and data banks

New practice

Learnig resources

Data banks

Geo resources

Questions?